Software Fault Tolerance in Computer Operating Systems
نویسنده
چکیده
This chapter provides, data and analysis of the dependability and fault tolerance for three operating systems: the Tandem/GUARDIAN fault-tolerant system, the VAX/VMS distributed system, and the IBM/MVS system. Based on measurements from these systems, basic software error characteristics are investigated. Fault tolerance in operating systems resulting from the use of process pairs and recovery routines is evaluated. Two levels of models are developed to analyze error and recovery processes inside an operating system and interactions among multiple instances of an operating system running in a distributed environment. The measurements show that the use of process pairs in Tandem systems, which was originally intended for tolerating hardware faults, allows the system to tolerate about 70% of defects in system software that result in processor failures. The loose coupling between processors which results in the backup execution (the processor state and the sequence of events occurring) being different from the original execution is a major reason for the measured software fault tolerance. The IBM/MVS system fault tolerance almost doubles when recovery routines are provided, in comparison to the case in which no recovery routines are available. However, even when recovery routines are provided, there is almost a 50% chance of system failure when critical system jobs are involved.
منابع مشابه
Proposing an Efficient Software-based Method to Enhance Reliability of Computer Systems against Soft Errors
In recent years, along with rapid developments in technology, computer systems haveincreasingly become more integrated and more modular. Indeed, the reliability and efficiency ofcomputer systems are of high significance. Hence, the quantitative evaluation of the optimizationof reliability indexes in computer systems is considered to be a crucial issue. Reliabilityenhancement of computer systems...
متن کاملAn Overview of Software Fault Tolerant Computing
Fault-tolerance is the survival attribute of the real-time software system. The software should provide correct results in the face of various failures. A major technological concern for the coming years is the ever widening gap between the demand for high quality, robust software and its supply. Software fault tolerance is basically the design faults in the computer system. Its function is to ...
متن کاملApplication-layer Fault-Tolerance Protocols
The central topic of this book is application-level fault-tolerance, that is the methods, architectures, and tools that allow to express a fault-tolerant system in the application software of our computers. Application-level fault-tolerance is a sub-class of software fault-tolerance that focuses on the problems of expressing the problems and solutions of fault-tolerance in the top layer of the ...
متن کاملHigh-Coverage Fault Tolerance in Real-Time Systems Based on Point-to-Point Communication
The distributed recovery block (DRB) scheme is a widely applicable approach for realizing both hardware and software fault tolerance in real-time distributed and parallel computer systems. One of the most important extensions of the DRB scheme which were outlined in recent years but not developed fully is the integration of the DRB scheme and a network surveillance (NS) scheme. We recently deve...
متن کاملAn Integration of the Primary-Shadow TMO Replication Scheme with a Supervisor-Based Network Surveillance Scheme and Its Recovery Time Bound Analysis
The time-triggered message-triggered object (TMO) scheme was formulated a few years ago as a major extension of the conventional object structuring schemes with the idealistic goal of facilitating general-form design and timeliness-guaranteed design of complex real-time application systems. Recently, as a new scheme for realizing TMO-structured distributed and parallel computer systems capable ...
متن کامل